SenseSmart¶

Jerry Zhao, Xuying Yang, Yujuan Zhou, Clement Mo, Haozheng Liu


Project Name: Social Media Sentiment Analysis Dataset¶

Dataset: Social Media Sentiment Analysis Dataset

This project aims to analyze user-generated content across various social media platforms to uncover sentiment trends and user behavior. The dataset offers a rich source of data, including text-based content, user sentiments, timestamps, hashtags, user engagement metrics (likes and retweets), and geographical information. By exploring this data, we can identify how emotions fluctuate over time, platform, and geography. We will also investigate the correlation between popular content and user engagement metrics.

Problem Statement: The primary goal is to perform sentiment analysis, investigate temporal and geographical trends in user-generated content, and analyze platform-specific user behavior. The project will focus on identifying popular topics through hashtags, exploring engagement levels, and understanding regional differences in sentiment trends.

Tasks:

  • Dataset Exploration:
    • Gain familiarity with the dataset by understanding its structure and key features such as sentiment, timestamps, and user engagement (likes and retweets).
  • Sentiment Analysis:
    • Conduct sentiment analysis to classify the user-generated content into different categories such as surprise, excitement, admiration, etc.
    • Visualize the distribution of sentiments and examine the emotional landscape of social media platforms.
  • Temporal Analysis:
    • Explore temporal patterns in user sentiment over time using the "Timestamp" column.
    • Identify recurring themes, seasonal variations, or any significant trends in the data.
  • User Engagement Insights:
    • Analyze user engagement by studying the likes and retweets associated with posts.
    • Investigate how sentiment correlates with higher levels of user engagement.
  • Platform-Specific Analysis:
    • Compare sentiment trends across various platforms using the "Platform" column.
    • Identify how emotions differ depending on the platform.
  • Hashtag and Topic Trends:
    • Explore trending topics by analyzing the hashtags.
    • Investigate the relationship between hashtags and user engagement or sentiment.
  • Geographical Trends:
    • Examine regional sentiment variations using the "Country" column.
    • Understand how social media content and sentiment differ across various regions.
  • Cross-Feature Analysis:
    • Combine features (e.g., sentiment and hashtags, sentiment and platform) to uncover deeper insights about user behavior and content trends.
  • Predictive Modeling (Optional):
    • Explore the possibility of building predictive models to predict user engagement (likes/retweets) based on sentiment, hashtags, and platform.
    • Evaluate the performance of the model and explore its potential for predicting popular content.

Students are encouraged to draw connections between data-driven insights and potential policy implications.

Students are encouraged to draw connections between data-driven insights and potential policy implications. The project should foster a deeper understanding of the dynamics of air quality in India and its impact on public health and the environment.

Import Libraries¶

In [1]:
# System Utilities
import os
import sys
import re
import math
import random
import colorsys
import shutil
import zipfile
import subprocess
from pathlib import Path

# Data Handling
import pandas as pd
import numpy as np
import warnings

# Visualization Libraries
import matplotlib.pyplot as plt
from matplotlib import font_manager, rcParams
import plotly.express as px

# WordCloud
from PIL import Image, ImageFont
from wordcloud import WordCloud, STOPWORDS
import jieba
import colorsys
from collections import Counter

# Other
from IPython.display import display # DataFrame
import kagglehub # Download KaggleHub dataset
In [2]:
import os
import sys
import re
import math
import random
import colorsys
import shutil
import zipfile
import subprocess
from pathlib import Path
import warnings

# Data Handling
import pandas as pd
import numpy as np
from IPython.display import display # For DataFrame display

# Visualization
import matplotlib.pyplot as plt
from matplotlib import font_manager, rcParams
import plotly.express as px # Plotly
import plotly.graph_objects as go

# WordCloud
from PIL import Image, ImageFont
from wordcloud import WordCloud, STOPWORDS
import jieba
from collections import Counter

# Machine Learning
from sklearn.model_selection import train_test_split, RandomizedSearchCV

# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Classification Models
from sklearn.linear_model import PassiveAggressiveClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# Metrics
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)

import kagglehub # To download dataset from KaggleHub

Dataset acquisition¶

Dataset: Social Media Sentiment Analysis Dataset

In [3]:
# Install needed packages
%pip -q install kagglehub pandas matplotlib scikit-learn nltk
Note: you may need to restart the kernel to use updated packages.
In [4]:
# Download dataset latest version
# path = kagglehub.dataset_download("kashishparmar02/social-media-sentiments-analysis-dataset")
cache = Path(kagglehub.dataset_download("kashishparmar02/social-media-sentiments-analysis-dataset"))
print("KaggleHub cache:", cache)

# Prepare and clear ./data folder
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)
for p in data_dir.iterdir():
    if p.is_file():
        p.unlink() # remove file
    else:
        shutil.rmtree(p)

# Collect .csv files
csv_found = []
for f in cache.rglob("*.csv"):
    dst = data_dir / f.name
    if not dst.exists(): # Avoid duplicates
        shutil.copy2(f, dst)
        csv_found.append(dst.name)

# If none found, scan all zips and extract ONLY .csv files into ./data
if not csv_found:
    for z in cache.rglob("*.zip"):
        try:
            with zipfile.ZipFile(z) as zf:
                for member in zf.infolist():
                    # Filter by extension .csv
                    name = Path(member.filename).name
                    if name.lower().endswith(".csv"):
                        with zf.open(member) as src, open(data_dir / name, "wb") as dst:
                            shutil.copyfileobj(src, dst)
                        csv_found.append(name)
        except Exception as e:
            print("Skip bad zip:", z, "->", e)

# 5) 结果检查与回显 / Verify result and show summary
if not csv_found:
    raise FileNotFoundError("No .csv found.")
print("Path to dataset files:", data_dir.resolve(), ", the dataset file name is:", csv_found)
KaggleHub cache: C:\Users\yujua\.cache\kagglehub\datasets\kashishparmar02\social-media-sentiments-analysis-dataset\versions\3
Path to dataset files: C:\Users\yujua\Desktop\F25\DataScience_BootCamp\SenseSmart\data , the dataset file name is: ['sentimentdataset.csv']

Load Data & Column Standardization¶

Text — the post text Sentiment — emotion label (e.g., Positive, Negative, Neutral, ...) Timestamp — time the post was made Platform — social platform name Likes — number of likes Retweets — number of retweets Country — country string Hashtags — raw hashtag text

In [5]:
# Data directory ./data
DATA_DIR = Path("data")

# Pick the dataset file in ./data
csv_files = sorted(DATA_DIR.glob("*.csv"))
if not csv_files:
    raise FileNotFoundError("No data file found in ./data.\n")
DATA_FILE = csv_files[0]
print("Selected file:", DATA_FILE.name)

# Read the CSV
df = pd.read_csv(DATA_FILE, low_memory=False)
print("Raw shape:", df.shape)
print("Raw columns:", list(df.columns))

# Normalize original column names to lowercase + strip for matching
# df = df.rename(columns={c: str(c).lower().strip() for c in df.columns})

# Drop index-duplicates: Unnamed
unnamed_cols = [c for c in df.columns if re.match(r"^Unnamed", str(c), flags=re.IGNORECASE)]
if unnamed_cols:
    df = df.drop(columns=unnamed_cols)
    print("Dropped Unnamed columns:", unnamed_cols)

# Preview
desired_order = ["Text", "Sentiment", "Timestamp", "User", 
                 "Platform", "Hashtags", "Retweets", "Likes", 
                 "Country", "Year", "Month", "Day", "Hour"
]

if all(col in df.columns for col in desired_order):
    view = df[desired_order]
else:
    present = [c for c in desired_order if c in df.columns]
    others  = [c for c in df.columns if c not in present]
    view = df[present + others]
print("\nPreview:")
display(view.head())

print("Final shape:", view.shape)
print("Final columns:", list(view.columns))
Selected file: sentimentdataset.csv
Raw shape: (732, 15)
Raw columns: ['Unnamed: 0.1', 'Unnamed: 0', 'Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour']
Dropped Unnamed columns: ['Unnamed: 0.1', 'Unnamed: 0']

Preview:
Text Sentiment Timestamp User Platform Hashtags Retweets Likes Country Year Month Day Hour
0 Enjoying a beautiful day at the park! ... Positive 2023-01-15 12:30:00 User123 Twitter #Nature #Park 15.0 30.0 USA 2023 1 15 12
1 Traffic was terrible this morning. ... Negative 2023-01-15 08:45:00 CommuterX Twitter #Traffic #Morning 5.0 10.0 Canada 2023 1 15 8
2 Just finished an amazing workout! 💪 ... Positive 2023-01-15 15:45:00 FitnessFan Instagram #Fitness #Workout 20.0 40.0 USA 2023 1 15 15
3 Excited about the upcoming weekend getaway! ... Positive 2023-01-15 18:20:00 AdventureX Facebook #Travel #Adventure 8.0 15.0 UK 2023 1 15 18
4 Trying out a new recipe for dinner tonight. ... Neutral 2023-01-15 19:55:00 ChefCook Instagram #Cooking #Food 12.0 25.0 Australia 2023 1 15 19
Final shape: (732, 13)
Final columns: ['Text', 'Sentiment', 'Timestamp', 'User', 'Platform', 'Hashtags', 'Retweets', 'Likes', 'Country', 'Year', 'Month', 'Day', 'Hour']

Platform Sentiment Bias Analysis¶

In [6]:
pd.set_option('display.show_dimensions', True)  # The default is'truncate',change to True make it always display all of them

# 1) Basic validations, ensure there have 'Platform', 'Sentiment' columns
required_cols = {'Platform', 'Sentiment'}
missing = required_cols - set(df.columns)
assert not missing, f"Missing required columns: {missing}"

# Copy and cleaning
df = df.copy()
df['Platform'] = df['Platform'].astype(str).str.strip()
df['Sentiment'] = df['Sentiment'].astype(str).str.strip()

# Drop rows with blank critical fields
df = df[(df['Platform'] != '') & (df['Sentiment'] != '')].reset_index(drop=True)

# Normalize platform names ("twitter"->"Twitter")
df['Platform'] = df['Platform'].str.lower().str.capitalize()

# 2) Sentiment mapping by VADER
# compound ∈ [-1,1]. Positive > 0.05;Negative < -0.05;other is Neutral(VADER suggest)
import nltk
nltk.download('vader_lexicon', quiet=True)
from nltk.sentiment import SentimentIntensityAnalyzer
_vader = SentimentIntensityAnalyzer()

def map_to_polarity(text: str, pos_th: float = 0.05, neg_th: float = -0.05) -> str:
    # Map phrases to 3 classes using VADER compound score: Positive / Negative / Neutral
    if not isinstance(text, str) or text.strip() == '':
        return 'Neutral'  # Empty is Neutral
    comp = _vader.polarity_scores(text)['compound']
    if comp > pos_th:
        return 'Positive'
    elif comp < neg_th:
        return 'Negative'
    else:
        return 'Neutral'

df['Sentiment_Category'] = df['Sentiment'].apply(map_to_polarity)

print("Mapping preview:")
display(df[['Sentiment', 'Sentiment_Category']].head(100))

# List terms final class for check
def _clean_term(s: str) -> str:
    return str(s).strip()

pos_terms_vc = (
    df.loc[df['Sentiment_Category'] == 'Positive', 'Sentiment']
      .map(_clean_term)
      .value_counts()
)
neg_terms_vc = (
    df.loc[df['Sentiment_Category'] == 'Negative', 'Sentiment']
      .map(_clean_term)
      .value_counts()
)
neu_terms_vc = (
    df.loc[df['Sentiment_Category'] == 'Neutral', 'Sentiment']
      .map(_clean_term)
      .value_counts()
)

print("\n🟢 Positive terms")
display(pos_terms_vc.to_frame('count').reset_index().rename(columns={'index': 'term'}))

print("\n🔴 Negative terms")
display(neg_terms_vc.to_frame('count').reset_index().rename(columns={'index': 'term'}))

print("\n⚪ Neutral terms")
display(neu_terms_vc.to_frame('count').reset_index().rename(columns={'index': 'term'}))

# Export complete file
pos_terms_vc.to_csv("classified_positive_terms.csv")
neg_terms_vc.to_csv("classified_negative_terms.csv")
neu_terms_vc.to_csv("classified_neutral_terms.csv")
print("Save as classified_positive_terms.csv / classified_negative_terms.csv / classified_neutral_terms.csv")
Mapping preview:
Sentiment Sentiment_Category
0 Positive Positive
1 Negative Negative
2 Positive Positive
3 Positive Positive
4 Neutral Neutral
... ... ...
95 Confusion Negative
96 Excitement Positive
97 Kind Positive
98 Pride Positive
99 Shame Negative

100 rows × 2 columns

🟢 Positive terms
Sentiment count
0 Positive 45
1 Joy 44
2 Excitement 37
3 Contentment 19
4 Gratitude 18
... ... ...
75 Thrilling Journey 1
76 Creative Inspiration 1
77 Runway Creativity 1
78 Ocean's Freedom 1
79 Relief 1

80 rows × 2 columns

🔴 Negative terms
Sentiment count
0 Despair 11
1 Grief 9
2 Loneliness 9
3 Sad 9
4 Embarrassed 8
5 Confusion 8
6 Frustration 6
7 Melancholy 6
8 Regret 6
9 Indifference 6
10 Hate 6
11 Bad 6
12 Numbness 6
13 Disgust 5
14 Bitterness 5
15 Frustrated 5
16 Betrayal 5
17 Negative 4
18 Boredom 4
19 Heartbreak 3
20 Jealousy 3
21 Resentment 3
22 Shame 3
23 Bitter 3
24 Devastated 3
25 Envious 3
26 Fearful 3
27 Jealous 3
28 Sadness 2
29 Fear 2
30 Anger 2
31 Disappointed 2
32 Loss 2
33 Helplessness 2
34 Intimidation 2
35 Anxiety 2
36 Envy 2
37 Isolation 2
38 Disappointment 2
39 Sorrow 2
40 Bittersweet 1
41 Darkness 1
42 Exhaustion 1
43 Suffering 1
44 Desperation 1
45 Pressure 1
46 Ruins 1
47 Obstacle 1

48 rows × 2 columns

⚪ Neutral terms
Sentiment count
0 Neutral 18
1 Curiosity 16
2 Serenity 15
3 Nostalgia 11
4 Awe 9
... ... ...
58 Imagination 1
59 Mesmerizing 1
60 Winter Magic 1
61 Celestial Wonder 1
62 Whispers of the Past 1

63 rows × 2 columns

Save as classified_positive_terms.csv / classified_negative_terms.csv / classified_neutral_terms.csv
In [7]:
# 3) Aggregate counts & ratios
# Group by Platform and Sentiment_Category to count posts per sentiment
count_tbl = (
    df.groupby(['Platform', 'Sentiment_Category'], observed=True) # observed=True to aviod FutureWarning
      .size()
      .unstack(fill_value=0) # Expand the column into Positive/Negative/Neutral, and fill in 0 for missing values.
)

# Ensure all 3 sentiment columns exist
for col in ['Positive', 'Negative', 'Neutral']:
    if col not in count_tbl.columns:
        count_tbl[col] = 0

# Totals and ratios
count_tbl['Total'] = count_tbl[['Positive', 'Negative', 'Neutral']].sum(axis=1)
safe_total = count_tbl['Total'].replace(0, np.nan)  # 避免除以 0
count_tbl['Positive_Ratio'] = count_tbl['Positive'] / safe_total
count_tbl['Negative_Ratio'] = count_tbl['Negative'] / safe_total
count_tbl['Neutral_Ratio']  = count_tbl['Neutral']  / safe_total

# 4) Correctness checks
# Check that counts add up to Total
_num_ok = (count_tbl['Positive'] + count_tbl['Negative'] + count_tbl['Neutral'] == count_tbl['Total']).all()
assert _num_ok, "Counts don't sum to Total."

# Check that ratios add up to 1
ratio_sum = (count_tbl['Positive_Ratio'] + count_tbl['Negative_Ratio'] + count_tbl['Neutral_Ratio'])
ratio_ok = np.allclose(ratio_sum.dropna().values, np.ones(ratio_sum.dropna().shape[0]), rtol=1e-6, atol=1e-6)
assert ratio_ok, "Ratios don't sum to 1."

print("📊 Counts & Ratios by Platform")
display(count_tbl.sort_index())

# Prepare a reset_index version for plotting
count_reset = count_tbl.reset_index()

# Make sure the platform column is called 'Platform'
if 'Platform' not in count_reset.columns:
    count_reset = count_reset.rename(columns={count_reset.columns[0]: 'Platform'})

# 5) Visualization
# Stacked bar for composition
fig_counts = px.bar(
    count_reset,
    x='Platform',
    y=['Positive', 'Negative', 'Neutral'],
    title="Sentiment Distribution by Platform (Counts)",
    labels={
        "value": "Post Count",
        "variable": "Sentiment",
        "Platform": "Platform"
    },
    color_discrete_sequence=px.colors.qualitative.Set2,
    template='plotly_dark'
)

fig_counts.update_layout(
    barmode='stack',
    width=900,
    height=500,
)

fig_counts.show()

# Grouped Bar: Ratios
ratio_plot = count_reset[['Platform', 'Positive_Ratio', 'Negative_Ratio', 'Neutral_Ratio']].copy()

# Melt into long form: Platform, Sentiment, Ratio
ratio_long = ratio_plot.melt(
    id_vars='Platform',
    value_vars=['Positive_Ratio', 'Negative_Ratio', 'Neutral_Ratio'],
    var_name='Sentiment',
    value_name='Ratio'
)

# Clean sentiment names for legend (drop "_Ratio")
ratio_long['Sentiment'] = ratio_long['Sentiment'].str.replace('_Ratio', '', regex=False)

fig_ratio = px.bar(
    ratio_long,
    x='Platform',
    y='Ratio',
    color='Sentiment',
    text='Ratio',
    barmode='group',
    title="Platform Sentiment Bias (Ratios)",
    labels={
        "Platform": "Platform",
        "Ratio": "Ratio",
        "Sentiment": "Sentiment"
    },
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Set2
)

# Show text as percentage
fig_ratio.update_traces(
    texttemplate='%{text:.1%}',
    textposition='outside'
)

ymax = ratio_long['Ratio'].max() * 1.2
fig_ratio.update_yaxes(
    range=[0, ymax],
    tickformat=".0%"
)

fig_ratio.update_layout(
    width=900,
    height=500,
)
fig_ratio.show()

# 6) Text Summary
print("📑 Summary")
for plat, row in count_tbl.iterrows():
    pos = float(row['Positive_Ratio']) if not math.isnan(row['Positive_Ratio']) else 0.0
    neg = float(row['Negative_Ratio']) if not math.isnan(row['Negative_Ratio']) else 0.0
    neu = float(row['Neutral_Ratio'])  if not math.isnan(row['Neutral_Ratio'])  else 0.0
    
    if pos > max(neg, neu):
        bias = "🟢 Positive-leaning"
    elif neg > max(pos, neu):
        bias = "🔴 Negative-leaning"
    elif neu > max(pos, neg):
        bias = "⚪ Neutral-leaning"
    else:
        bias = "⚖️ No clear bias"
        
    print(f"{plat}: Positive {pos:.1%}, Negative {neg:.1%}, Neutral {neu:.1%} → {bias}")
📊 Counts & Ratios by Platform
Sentiment_Category Negative Neutral Positive Total Positive_Ratio Negative_Ratio Neutral_Ratio
Platform
Facebook 56 50 125 231 0.541126 0.242424 0.216450
Instagram 62 62 134 258 0.519380 0.240310 0.240310
Twitter 65 59 119 243 0.489712 0.267490 0.242798

3 rows × 7 columns

📑 Summary
Facebook: Positive 54.1%, Negative 24.2%, Neutral 21.6% → 🟢 Positive-leaning
Instagram: Positive 51.9%, Negative 24.0%, Neutral 24.0% → 🟢 Positive-leaning
Twitter: Positive 49.0%, Negative 26.7%, Neutral 24.3% → 🟢 Positive-leaning

Sentiment vs Engagement (Retweets & Likes)¶

In [8]:
# Use VADER-based sentiment classification: Positive / Negative / Neutral

# Clean the sentiment category column
df['Sentiment_Category'] = df['Sentiment_Category'].astype(str).str.strip()

# Keep only the 3 VADER sentiment categories
sent_order = ['Positive', 'Negative', 'Neutral']
sent3_df = df[df['Sentiment_Category'].isin(sent_order)].copy()
sent3_df['Sentiment_3'] = pd.Categorical(
    sent3_df['Sentiment_Category'],
    categories=sent_order,
    ordered=True
)

print("Number of valid rows after VADER 3-way classification:", len(sent3_df))
print("Rows per sentiment:")
print(sent3_df['Sentiment_3'].value_counts())
Number of valid rows after VADER 3-way classification: 732
Rows per sentiment:
Sentiment_3
Positive    378
Negative    183
Neutral     171
Name: count, Length: 3, dtype: int64
In [9]:
# Calculate the mean/median/count of Retweets and Likes
sent_engagement = (
    sent3_df
        .groupby('Sentiment_3', sort=False, observed=True)[['Retweets', 'Likes']] # Grouped by sentiment, Retweets and Likes were tallied
        .agg(['mean', 'median', 'count'])
        .reindex(sent_order)
        .round(2)
)

print("Engagement by Sentiment (Mean / Median / Count)")
display(sent_engagement)
Engagement by Sentiment (Mean / Median / Count)
Retweets Likes
mean median count mean median count
Sentiment_3
Positive 22.89 22.0 378 45.64 45.0 378
Negative 17.35 18.0 183 34.63 35.0 183
Neutral 22.89 22.0 171 45.69 45.0 171

3 rows × 6 columns

In [10]:
# Compute mean Retweets & Likes per sentiment
plot_data = (
    sent3_df
        .groupby('Sentiment_3', sort=False, observed=True)[['Retweets', 'Likes']]
        .mean()
        .reindex(sent_order)
        .reset_index()
)
display(plot_data)

# Average Retweets
fig_ret = px.bar(
    plot_data,
    x='Sentiment_3',
    y='Retweets',
    color='Retweets',
    color_continuous_scale='Viridis',
    text='Retweets',
    title="Average Retweets by Sentiment",
    labels={
        'Sentiment_3': 'Sentiment',
        'Retweets': 'Average Retweets'
    },
    template='plotly_dark'
)

fig_ret.update_traces(
    texttemplate='%{text:.1f}',
    textposition='outside'
)
fig_ret.update_yaxes(range=[0, plot_data['Retweets'].max() * 1.2])
fig_ret.update_layout(width=900, height=500)
fig_ret.show()

# Average Likes
fig_like = px.bar(
    plot_data,
    x='Sentiment_3',
    y='Likes',
    color='Likes',
    color_continuous_scale='Viridis',
    text='Likes',
    title="Average Likes by Sentiment",
    labels={
        'Sentiment_3': 'Sentiment',
        'Likes': 'Average Likes'
    },
    template='plotly_dark'
)

fig_like.update_traces(
    texttemplate='%{text:.1f}',
    textposition='outside'
)
fig_like.update_yaxes(range=[0, plot_data['Likes'].max() * 1.2])
fig_like.update_layout(width=900, height=500)
fig_like.show()
Sentiment_3 Retweets Likes
0 Positive 22.894180 45.642857
1 Negative 17.349727 34.633880
2 Neutral 22.894737 45.690058

3 rows × 3 columns

In [11]:
# Boxplots: distribution of Retweets & Likes by Sentiment

# Retweets boxplot
fig_ret_box = px.box(
    sent3_df,
    x='Sentiment_3',
    y='Retweets',
    color='Sentiment_3',
    category_orders={'Sentiment_3': sent_order},
    title="Retweets Distribution by Sentiment",
    labels={
        'Sentiment_3': 'Sentiment',
        'Retweets': 'Retweets'
    },
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig_ret_box.update_layout(
    width=900,
    height=500
)
fig_ret_box.show()

# Likes boxplot
fig_like_box = px.box(
    sent3_df,
    x='Sentiment_3',
    y='Likes',
    color='Sentiment_3',
    category_orders={'Sentiment_3': sent_order},  # 保证顺序一致
    title="Likes Distribution by Sentiment",
    labels={
        'Sentiment_3': 'Sentiment',
        'Likes': 'Likes'
    },
    template='plotly_dark',
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig_like_box.update_layout(
    width=900,
    height=500
)
fig_like_box.show()
In [12]:
# Detailed statistics table for each sentiment, used to assist in interpreting the boxplots
stats = []

for sentiment in sent_order:
    # Extract values for each sentiment
    ret_values = sent3_df[sent3_df['Sentiment_3'] == sentiment]['Retweets']
    like_values = sent3_df[sent3_df['Sentiment_3'] == sentiment]['Likes']

    stats.append({
        "Sentiment": sentiment,

        # Retweets
        "Retweets_Q1": np.percentile(ret_values, 25),
        "Retweets_Median": np.median(ret_values),
        "Retweets_Q3": np.percentile(ret_values, 75),
        "Retweets_Min": ret_values.min(),
        "Retweets_Max": ret_values.max(),
        "Retweets_SampleSize": len(ret_values),

        # Likes
        "Likes_Q1": np.percentile(like_values, 25),
        "Likes_Median": np.median(like_values),
        "Likes_Q3": np.percentile(like_values, 75),
        "Likes_Min": like_values.min(),
        "Likes_Max": like_values.max(),
        "Likes_SampleSize": len(like_values)
    })

stats_df = pd.DataFrame(stats).round(2)

print("Detailed Statistical Summary for Sentiment Categories")
display(stats_df)
Detailed Statistical Summary for Sentiment Categories
Sentiment Retweets_Q1 Retweets_Median Retweets_Q3 Retweets_Min Retweets_Max Retweets_SampleSize Likes_Q1 Likes_Median Likes_Q3 Likes_Min Likes_Max Likes_SampleSize
0 Positive 18.0 22.0 28.0 8.0 40.0 378 35.0 45.0 55.0 15.0 80.0 378
1 Negative 12.0 18.0 22.0 5.0 40.0 183 25.0 35.0 45.0 10.0 80.0 183
2 Neutral 18.0 22.0 28.0 10.0 40.0 171 35.0 45.0 55.0 20.0 80.0 171

3 rows × 13 columns

Word Cloud¶

In [13]:
# Dependencies
pkgs = ["wordcloud", "jieba", "pillow", "matplotlib", "numpy", "pandas"]
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])
print("Installed:", ", ".join(pkgs))
Installed: wordcloud, jieba, pillow, matplotlib, numpy, pandas
In [14]:
# Configurable Parameters
OUT_DIR = "outputs" # Output folder for generated PNGs and top-token CSVs
CSV_FALLBACK_PATH = "/mnt/data/sentimentdataset.csv" # If no global df is found, fall back to reading this CSV
TEXT_COL_HINT = "text" # Preferred name of the text column (case-insensitive)
GROUP_BY_COL = None # If set (e.g., "Platform"), build one cloud per group; else a single cloud.
MASK_PATH = None # Optional mask image path (white=allowed area)
RAND_SEED = 42 # Random seed for reproducibility
IMG_SIZE = (2400, 1400) # Output resolution in pixels: width x height
MAX_WORDS = 500 # Maximum number of words to display
BACKGROUND_COLOR = "white" # Background color (e.g., "black").
RELATIVE_SCALING = 0.5 # Strength of frequency-to-font-size mapping (0~1)
PREFER_HORIZONTAL = 0.9 # Proportion of words drawn horizontally (0~1)
COLORMAP_NAME = None # If using Matplotlib colormap (e.g., "tab20"), set COLOR_FUNC=None and set this

random.seed(RAND_SEED); np.random.seed(RAND_SEED)

# Pre-configure fonts to minimize missing glyph warnings for CJK
rcParams["font.sans-serif"] = ["SimHei","Microsoft YaHei","Arial Unicode MS","DejaVu Sans"]
rcParams["axes.unicode_minus"] = False

# Text cleaning & tokenization
URL  = re.compile(r"https?://\S+|www\.\S+", re.I) # strip URLs
AT   = re.compile(r"@[A-Za-z0-9_]+") # strip @mentions
HASH = re.compile(r"#") # remove '#'
HTML = re.compile(r"&[A-Za-z]+;") # handle HTML entities
CN_CHAR = re.compile(r"[\u4e00-\u9fff]") # detect CJK chars
EN_ONLY = re.compile(r"[^a-z']+") # keep [a-z'] only

def clean_line(s: str) -> str:
    # Basic cleaning: lowercase, strip URL/@/HTML/entities, collapse spaces
    s = str(s).lower()
    s = URL.sub(" ", s); s = AT.sub(" ", s); s = HASH.sub("", s); s = HTML.sub(" ", s)
    return re.sub(r"\s+", " ", s).strip()

def tokenize_mixed(text: str):
    if CN_CHAR.search(text):
        return [t.strip() for t in jieba.cut(text) if t.strip()]
    t = EN_ONLY.sub(" ", text)
    return [w for w in t.split() if w]

# Stopwords
EN_STOP = set(STOPWORDS) | {
    "rt","amp","im","ive","dont","didnt","doesnt","cant","couldnt","isnt","wasnt",
    "arent","werent","youre","youve","youll","theyre","weve","well","hes","shes",
    "thats","theres","whats","a","an","the","is","are","was","were","be","been","being",
    "i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves",
    "he","him","his","himself","she","her","hers","herself","it","its","itself","they","them","their","theirs","themselves",
    "this","that","these","those","and","but","if","or","because","as","until","while",
    "of","at","by","for","with","about","against","between","into","through","during","before","after",
    "above","below","to","from","up","down","in","out","on","off","over","under",
    "again","further","then","once","here","there","when","where","why","how","all","any","both","each",
    "few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very",
    "s","t","can","will","just","don","should","now"
}

STOP = EN_STOP

# Optional mask loader (white pixels = allowed region for words).
def load_mask(mask_path):
    if not mask_path:
        return None
    img = Image.open(mask_path).convert("L") # convert to grayscale
    arr = np.array(img)
    # Treat bright area as 255 (allowed), others 0
    return np.where(arr > 200, 255, 0).astype(np.uint8)

MASK_ARRAY = load_mask(MASK_PATH)

# Resolve the text column in a case-insensitive way
def get_text_series(df, col_hint="text"):
    cmap = {c.lower().strip(): c for c in df.columns}
    if col_hint.lower() in cmap:
        col = cmap[col_hint.lower()]
    else:
        cands = [c for c in df.columns if "text" in c.lower()]
        if not cands:
            raise KeyError("Text column not found")
        col = cands[0]
    return df[col].astype(str).fillna("")

# Color strategy (two options)
def color_func(word, font_size, position, orientation, random_state=None, **kwargs):
    # Soft random HLS colors within readable saturation/lightness ranges
    h = random.randint(0, 359) / 360.0
    s = random.randint(55, 85) / 100.0
    l = random.randint(35, 60) / 100.0
    r, g, b = colorsys.hls_to_rgb(h, l, s)
    return (int(r*255), int(g*255), int(b*255))

# Matplotlib colormap (e.g., "tab20"), set COLORMAP_NAME and set COLOR_FUNC=None.
COLOR_FUNC = color_func if COLORMAP_NAME is None else None
COLOR_FUNC = color_func if COLORMAP_NAME is None else None

# Main entry: build a word cloud from a pandas Series of text and export PNG
def build_wordcloud_from_series(text_series: pd.Series, out_prefix: str):
    cleaned = text_series.map(clean_line)

    tokens = []
    for line in cleaned:
        for t in tokenize_mixed(line):
            if t not in STOP and len(t) > 1:
                tokens.append(t)

    if not tokens:
        raise ValueError("No valid terms detected after preprocessing. Please check input or adjust the stopword settings.")

    freq = Counter(tokens) # Frequency count

    # Persist top-N token frequencies for audit/plots
    os.makedirs(OUT_DIR, exist_ok=True)
    pd.DataFrame(freq.most_common(300), columns=["token","count"])\
      .to_csv(os.path.join(OUT_DIR, f"{out_prefix}_top_tokens.csv"), index=False)

    wc = WordCloud(
        width=IMG_SIZE[0], height=IMG_SIZE[1],
        background_color=BACKGROUND_COLOR,
        max_words=MAX_WORDS,
        prefer_horizontal=PREFER_HORIZONTAL,
        relative_scaling=RELATIVE_SCALING,
        mask=MASK_ARRAY,
        colormap=COLORMAP_NAME        # If COLOR_FUNC is set, recolor below overrides this
    ).generate_from_frequencies(freq)

    # Optional recolor with custom function
    if COLOR_FUNC is not None:
        wc = wc.recolor(color_func=COLOR_FUNC, random_state=RAND_SEED)

    # Export the PNG
    out_png = os.path.join(OUT_DIR, f"{out_prefix}.png")
    wc.to_file(out_png)

    # Preview in notebooks
    plt.figure(figsize=(12, 7))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()

    print(f"✅ Saved image: {out_png}")
    return out_png

# Read df if not provided externally
try:
    _ = df
except NameError:
    if not os.path.exists(CSV_FALLBACK_PATH):
        raise FileNotFoundError("No DataFrame detected, and the CSV file could not be found. Please load the DataFrame or update the CSV_FALLBACK_PATH.")
    df = pd.read_csv(CSV_FALLBACK_PATH, low_memory=False)

# Run: one overall cloud or per-group multiple clouds
if GROUP_BY_COL:
    # Group by the column and build one cloud per group
    for key, g in df.groupby(GROUP_BY_COL):
        try:
            text_s = get_text_series(g, TEXT_COL_HINT)
            build_wordcloud_from_series(text_s, out_prefix=f"wordcloud_{GROUP_BY_COL}_{key}")
        except Exception as e:
            # Some groups may be empty after cleaning; safely skip
            print(f"skip {GROUP_BY_COL}={key}: {e}")
else:
    text_s = get_text_series(df, TEXT_COL_HINT)
    build_wordcloud_from_series(text_s, out_prefix="wordcloud_text")
No description has been provided for this image
✅ Saved image: outputs\wordcloud_text.png

Most Active Country¶

In [15]:
# Clean the Country column
df['Country_clean'] = df['Country'].astype(str).str.strip()

# Group by Country_clean to count number of posts per country
country_activity = (
    df.groupby('Country_clean')
      .size()
      .reset_index(name='Num_Posts') # Number of posts per country
      .sort_values(by='Num_Posts', ascending=False) # Sort by number of posts from highest to lowest
)
print("Top 5 most active countries by number of posts:")
display(country_activity.head(5))

# Get the most active country for later analysis
top_country = country_activity.iloc[0]['Country_clean']
print("\nMost active country:", top_country)
Top 5 most active countries by number of posts:
Country_clean Num_Posts
32 USA 188
31 UK 143
5 Canada 135
0 Australia 75
13 India 70

5 rows × 2 columns

Most active country: USA
In [16]:
# Top 10 Active Countries (Bar Plot)
# Count posts per cleaned country name
posts_per_country = (
    df['Country_clean']
      .value_counts()
      .reset_index()
)
posts_per_country.columns = ['Country', 'Posts']

# Select the top 10 countries by post count
top10_countries = posts_per_country.head(10)

# Use Plotly
fig = px.bar(
    top10_countries,
    x='Country',                 
    y='Posts',                   
    text='Posts',                
    color='Posts',               
    color_continuous_scale='Viridis',
    title='Top 10 Active Countries',
    template='plotly_dark'
)

# Place the numerical labels outside the pillars
fig.update_traces(textposition='outside')

# Increase y-axis max to prevent text cutoff
fig.update_yaxes(range=[0, top10_countries['Posts'].max() * 1.2])
fig.update_layout(
    height=500,
    width=900,
    xaxis_title="Country",
    yaxis_title="Number of Posts"
)
fig.show()

Peak Activity by Hours in the Most Active Country¶

In [17]:
# Filter rows for the most active country
df_top_country = df[df['Country_clean'] == top_country].copy()

# Count number of posts per hour
hour_activity = (
    df_top_country.groupby('Hour')
                  .size()
                  .reset_index(name='Posts')
                  .sort_values(by='Posts', ascending=False)
)

print("Top posting hours for:", top_country)
display(hour_activity.head(5))

# Plotly
fig = px.bar(
    hour_activity.sort_values('Hour'), # Sort hours ascending
    x='Hour',
    y='Posts',
    text='Posts', 
    color='Posts',
    color_continuous_scale='Viridis',
    title=f"Peak Activity by Hour in {top_country}",
    template='plotly_dark'
)

# Position text labels above bars
fig.update_traces(textposition='outside')

# Add headroom to avoid clipping text labels
fig.update_yaxes(range=[0, hour_activity['Posts'].max() * 1.2])
fig.update_layout(
    height=500,
    width=900,
    xaxis_title="Hour",
    yaxis_title="Number of Posts"
)
fig.show()
Top posting hours for: USA
Hour Posts
8 14 26
10 16 23
9 15 21
14 20 15
15 21 14

5 rows × 2 columns

Peak Activity Months in the Most Active Country¶

In [18]:
# df_top_country = df[df['Country_clean'] == top_country].copy()

# Count number of posts per month
month_activity_raw = (
    df_top_country.groupby('Month')
                  .size()
                  .reset_index(name='Posts')
)

print("Raw monthly activity (months that appeared in data):")
display(month_activity_raw)
all_months = pd.DataFrame({'Month': list(range(1, 13))})
month_activity = all_months.merge(month_activity_raw, on='Month', how='left')
month_activity['Posts'] = month_activity['Posts'].fillna(0).astype(int)

print("\nFull 12-month activity (1–12 months):")
display(month_activity)

month_names = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_activity["Month_Name"] = month_activity["Month"].apply(lambda x: month_names[x - 1])

# Plotly
fig = px.bar(
    month_activity,
    x='Month_Name',
    y='Posts',
    text='Posts',
    color='Posts',
    color_continuous_scale='Viridis',
    title=f"Peak Activity Months in {top_country}",
    template='plotly_dark'
)

fig.update_traces(textposition='outside')
fig.update_yaxes(range=[0, month_activity['Posts'].max() * 1.2])
fig.update_layout(
    height=500,
    width=900,
    xaxis_title="Month",
    yaxis_title="Number of Posts"
)
fig.show()
Raw monthly activity (months that appeared in data):
Month Posts
0 1 27
1 2 24
2 3 10
3 4 10
4 5 10
5 6 21
6 7 15
7 8 21
8 9 21
9 10 11
10 11 8
11 12 10

12 rows × 2 columns

Full 12-month activity (1–12 months):
Month Posts
0 1 27
1 2 24
2 3 10
3 4 10
4 5 10
5 6 21
6 7 15
7 8 21
8 9 21
9 10 11
10 11 8
11 12 10

12 rows × 2 columns

Posting Frequency by Hour¶

In [19]:
# Count number of posts per hour (0–23)
hour_counts_raw = (
    df.groupby('Hour')
      .size()
      .reset_index(name='Posts')
)

all_hours = pd.DataFrame({'Hour': list(range(24))})
hour_counts = all_hours.merge(hour_counts_raw, on='Hour', how='left')
hour_counts['Posts'] = hour_counts['Posts'].fillna(0).astype(int)

print("Hourly posting frequency (0–23):")
display(hour_counts)

# Plotly
fig = px.bar(
    hour_counts,
    x='Hour',
    y='Posts',
    text='Posts',
    color='Posts',
    color_continuous_scale='Viridis',
    title="Posting Frequency by Hour",
    template='plotly_dark'
)

fig.update_traces(textposition='outside')
fig.update_yaxes(range=[0, hour_counts['Posts'].max() * 1.2])
fig.update_layout(
    height=500,
    width=900,
    xaxis_title="Hour of Day",
    yaxis_title="Number of Posts"
)
fig.show()
Hourly posting frequency (0–23):
Hour Posts
0 0 1
1 1 0
2 2 1
3 3 3
4 4 0
5 5 1
6 6 4
7 7 7
8 8 23
9 9 28
10 10 30
11 11 37
12 12 38
13 13 30
14 14 94
15 15 47
16 16 69
17 17 48
18 18 65
19 19 75
20 20 50
21 21 41
22 22 33
23 23 7

24 rows × 2 columns

Engagement Heatmap by Hour and Platform¶

In [20]:
# Define Engagement metric (Retweets + Likes)

# Make sure Retweets / Likes have no NaN to avoid issues when adding
df['Retweets'] = df['Retweets'].fillna(0)
df['Likes'] = df['Likes'].fillna(0)

# Define Engagement as the sum of Retweets and Likes
df['Engagement'] = df['Retweets'] + df['Likes']

print("Engagement column created. Preview:")
df[['Platform', 'Hour', 'Retweets', 'Likes', 'Engagement']].head()
Engagement column created. Preview:
Out[20]:
Platform Hour Retweets Likes Engagement
0 Twitter 12 15.0 30.0 45.0
1 Twitter 8 5.0 10.0 15.0
2 Instagram 15 20.0 40.0 60.0
3 Facebook 18 8.0 15.0 23.0
4 Instagram 19 12.0 25.0 37.0

5 rows × 5 columns

In [21]:
# Engagement Heatmap by Hour and Platform
heat_df = (
    df.groupby(['Platform', 'Hour'], as_index=False)['Engagement']
      .sum()
)

# Pivot to matrix: rows = Platform, columns = Hour, values = Engagement
heat_pivot = (
    heat_df
    .pivot(index='Platform', columns='Hour', values='Engagement')
    .fillna(0)
)

all_hours = list(range(24))
heat_pivot = heat_pivot.reindex(columns=all_hours, fill_value=0)

# Use Plotly imshow to draw heatmap
fig = px.imshow(
    heat_pivot,
    x=heat_pivot.columns, # 0–23
    y=heat_pivot.index, # platform
    color_continuous_scale='Viridis',
    template='plotly_dark',
    labels={
        "x": "Hour of Day",
        "y": "Platform",
        "color": "Engagement"
    },
    title="Engagement Heatmap by Hour and Platform (24-hour)",
)

# Adjust figure size
fig.update_layout(
    height=500,
    width=900,
)
fig.show()

Machine Learning¶

In [22]:
# Prepare Data
text_col = 'Text'
label_col = 'Sentiment_Category'
valid_labels = ['Positive', 'Negative', 'Neutral']
ml_df = df[[text_col, label_col]].dropna().copy()
ml_df = ml_df[ml_df[label_col].isin(valid_labels)]

print("Sample size:", len(ml_df))
display(ml_df.head())

X_text = ml_df[text_col].astype(str)
y = ml_df[label_col].astype(str)
Sample size: 732
Text Sentiment_Category
0 Enjoying a beautiful day at the park! ... Positive
1 Traffic was terrible this morning. ... Negative
2 Just finished an amazing workout! 💪 ... Positive
3 Excited about the upcoming weekend getaway! ... Positive
4 Trying out a new recipe for dinner tonight. ... Neutral

5 rows × 2 columns

In [23]:
# Train/Test Split + TF-IDF

# Stratify
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print("Train size:", len(X_train), " Test size:", len(X_test))

# TF-IDF
vectorizer = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    stop_words='english'
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)
print("TF-IDF matrix shape:", X_train_tfidf.shape)
Train size: 585  Test size: 147
TF-IDF matrix shape: (585, 5000)
In [24]:
# Train Multiple Models
models = {
    "PassiveAggressive": PassiveAggressiveClassifier(max_iter=50, random_state=42),
    "LogisticRegression": LogisticRegression(max_iter=1000, random_state=42),
    "RandomForest": RandomForestClassifier(n_estimators=200, random_state=42),
    "SVM": SVC(kernel='linear', random_state=42),
    "MultinomialNB": MultinomialNB()
}

results = {}

for name, clf in models.items():
    print("\n" + "="*50)
    print(f"Training model: {name}")
    clf.fit(X_train_tfidf, y_train)
    y_pred = clf.predict(X_test_tfidf)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name} Accuracy: {acc:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    results[name] = {"model": clf, "accuracy": acc, "y_pred": y_pred}
==================================================
Training model: PassiveAggressive
PassiveAggressive Accuracy: 0.7143
Classification Report:
              precision    recall  f1-score   support

    Negative       0.80      0.65      0.72        37
     Neutral       0.58      0.41      0.48        34
    Positive       0.72      0.88      0.79        76

    accuracy                           0.71       147
   macro avg       0.70      0.65      0.66       147
weighted avg       0.71      0.71      0.70       147


==================================================
Training model: LogisticRegression
LogisticRegression Accuracy: 0.6599
Classification Report:
              precision    recall  f1-score   support

    Negative       0.94      0.41      0.57        37
     Neutral       0.86      0.18      0.29        34
    Positive       0.61      1.00      0.76        76

    accuracy                           0.66       147
   macro avg       0.80      0.53      0.54       147
weighted avg       0.75      0.66      0.60       147


==================================================
Training model: RandomForest
RandomForest Accuracy: 0.6463
Classification Report:
              precision    recall  f1-score   support

    Negative       0.80      0.32      0.46        37
     Neutral       0.69      0.26      0.38        34
    Positive       0.62      0.97      0.76        76

    accuracy                           0.65       147
   macro avg       0.70      0.52      0.53       147
weighted avg       0.68      0.65      0.60       147


==================================================
Training model: SVM
SVM Accuracy: 0.7279
Classification Report:
              precision    recall  f1-score   support

    Negative       0.89      0.65      0.75        37
     Neutral       0.79      0.32      0.46        34
    Positive       0.68      0.95      0.79        76

    accuracy                           0.73       147
   macro avg       0.78      0.64      0.67       147
weighted avg       0.76      0.73      0.70       147


==================================================
Training model: MultinomialNB
MultinomialNB Accuracy: 0.6531
Classification Report:
              precision    recall  f1-score   support

    Negative       0.94      0.43      0.59        37
     Neutral       0.80      0.12      0.21        34
    Positive       0.61      1.00      0.76        76

    accuracy                           0.65       147
   macro avg       0.78      0.52      0.52       147
weighted avg       0.74      0.65      0.59       147

In [25]:
# Model Accuracy Comparison

# Create accuracy table
acc_df = pd.DataFrame({
    "Model": list(results.keys()),
    "Accuracy": [results[m]["accuracy"] for m in results]
})

print("\nCompare Model Accuracy")
display(acc_df)

# Plot bar chart using Plotly
fig = px.bar(
    acc_df,
    x="Model", # Model name
    y="Accuracy",
    text="Accuracy",
    color="Accuracy",
    color_continuous_scale="Viridis",
    title="Accuracy Comparison of Models",
    template="plotly_dark"
)

# Update layout
fig.update_traces(
    texttemplate='%{text:.3f}',
    textposition='outside'
)

fig.update_layout(
    xaxis_title="Model",
    yaxis_title="Accuracy",
    yaxis=dict(range=[0, 1]),
    width=900,
    height=500
)
fig.show()
Compare Model Accuracy
Model Accuracy
0 PassiveAggressive 0.714286
1 LogisticRegression 0.659864
2 RandomForest 0.646259
3 SVM 0.727891
4 MultinomialNB 0.653061

5 rows × 2 columns

In [26]:
# Confusion Matrix
# Select the model with the highest accuracy
best_name = max(results, key=lambda x: results[x]["accuracy"])
best_model = results[best_name]["model"]
best_pred = results[best_name]["y_pred"]
print(f"\nBest model selected: {best_name}")
labels_sorted = ['Negative', 'Neutral', 'Positive']

# Compute confusion matrix
cm = confusion_matrix(y_test, best_pred, labels=labels_sorted)

# Convert to DataFrame for Plotly
cm_df = pd.DataFrame(cm, index=labels_sorted, columns=labels_sorted)

# Plotly
fig = px.imshow(
    cm_df,
    text_auto=True, # Show numbers inside boxes
    color_continuous_scale='Blues',
    title=f"Confusion Matrix (Plotly) - {best_name}",
)

fig.update_layout(
    xaxis_title="Predicted Label",
    yaxis_title="True Label",
    width=900,
    height=600
)
fig.show()
Best model selected: SVM
In [27]:
# Train Logistic Regression (for feature explanation)
log_clf = LogisticRegression(max_iter=1000, random_state=42)
log_clf.fit(X_train_tfidf, y_train)

# Get feature names from TF-IDF vectorizer
feature_names = vectorizer.get_feature_names_out()

# Logistic regression coefficient per class
coefs = log_clf.coef_
class_labels = log_clf.classes_
top_n = 15# print top 15 strongest feature words per class

# Loop through each sentiment class
for idx, label in enumerate(class_labels):
    coef_for_class = coefs[idx]
    
    # Words with highest positive coefficients
    top_pos_idx = np.argsort(coef_for_class)[-top_n:][::-1]
    
    # Words with lowest coefficients
    top_neg_idx = np.argsort(coef_for_class)[:top_n]
    
    print("\n=========================================")
    print(f"Sentiment Class: {label}")
    
    print("\nTop Positive Words (strong indicators)")
    top_pos_words = [feature_names[i] for i in top_pos_idx]
    print(top_pos_words)
    
    print("\nTop Negative Words (opposite indicators)")
    top_neg_words = [feature_names[i] for i in top_neg_idx]
    print(top_neg_words)
=========================================
Sentiment Class: Negative

Top Positive Words (strong indicators)
['despair', 'loneliness', 'thoughts', 'grief', 'jealousy', 'injustice', 'resentment', 'lingers', 'lost', 'labyrinth', 'bad', 'like', 'heart', 'confusion', 'trust']

Top Negative Words (opposite indicators)
['new', 'beauty', 'laughter', 'excitement', 'exploring', 'serenity', 'friends', 'curiosity', 'nature', 'tales', 'concert', 'sky', 'ancient', 'gratitude', 'surprise']

=========================================
Sentiment Class: Neutral

Top Positive Words (strong indicators)
['serenity', 'curiosity', 'awe', 'knowledge', 'fulfillment', 'reverence', 'nostalgia', 'ambivalence', 'empowerment', 'arousal excitement', 'arousal', 'moonlit', 'mysteries', 'uncertainty', 'tales']

Top Negative Words (opposite indicators)
['day', 'heart', 'despair', 'surprise', 'loneliness', 'gratitude', 'just', 'inspiration', 'weekend', 'hopeful', 'friend', 'grief', 'morning', 'warmth', 'contentment']

=========================================
Sentiment Class: Positive

Top Positive Words (strong indicators)
['new', 'surprise', 'gratitude', 'laughter', 'friends', 'inspiration', 'weekend', 'hopeful', 'joy', 'creativity', 'pride', 'elation', 'contentment', 'euphoria', 'kindness']

Top Negative Words (opposite indicators)
['lost', 'serenity', 'despair', 'shattered', 'thoughts', 'loneliness', 'silent', 'curiosity', 'knowledge', 'echoes', 'night', 'awe', 'emotional', 'fulfillment', 'labyrinth']
In [ ]: